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Abstract 

In this paper we describe a new algorithm for buffered global routing according to a prescribed buffer 
site map. Specifically, we describe a provably good multi-commodity flow based algorithm that finds 
a global routing minimizing buffer and wire congestion subject to given constraints on routing area 
(wirelength and number of buffers) and sink delays. Our algorithm allows computing the tradeoff curve 
between routing area and wire/buffer congestion under any combination of delay and capacity constraints, 
and simultaneously performs buffer/wire sizing, as well as layer and pin assignment. Experimental results 
show that near-optimal results are obtained with a practical runtime. 

1 Introduction 

Due to delay scaling effects in deep-submicron technologies, interconnect planning and synthesis are becoming 
critical to meeting chip performance targets with reduced design turnaround time. In particular, the global 
routing phase of the design cycle is receiving renewed interest, as it must efficiently handle increasingly more 
complex constraints for increasingly larger designs (see |13| for a recent survey). In addition to handling 
traditional objectives such as congestion, wirelength and timing, a critical requirement for next generations of 
global routers is the integration with other interconnect optimizations, most importantly with buffer insertion 
and sizing. Indeed, it is estimated that top-level on-chip interconnect will require up to 10^ repeaters when 
we reach the 50nm technology node. Since these repeaters are large and have a significant impact on global 
routing congestion, buffer insertion and sizing can no longer be done after global routing completes. 

In this paper, we present and enhance a powerful integrated approach introduced in 3 for congestion 
and timing-driven global routing, buffer insertion, pin assignment, and buffer/wire sizing. Our approach is 
based on a multicommodity flow formulation for the buffered global routing problem. Multicommodity flow 
based global routing has been an active research area since the seminal work of Raghavan and Thomson 
Although the global routing problem is NP-hard (even highly restricted versions of it, see ^^), ^Hl has 
shown that the optimum solution can be approximated arbitrarily close in time polynomial in the number 
of nets and the inverse of the accuracy. To date, predictability of solution quality continues to be a distinct 
advantage of multicommodity flow based methods over all other approaches to global routing, including 
popular rip- up-and- reroute approaches |13j . 

The original method of Raghavan and Thomson relies on randomized rounding of an optimum fractional 
multicommodity flow. Subsequent works [7lll4| have improved runtime scalability by using the approximation 
algorithm for multicommodity flows by |17j. Yet, only the recent breakthrough improvements due to Garg 
and Konemann |12j and Fleischer jll) have rendered multicommodity flow based global routing practical for 
full chip designs As our algorithm is built upon the efficient multicommodity flow approximation 
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scheme of |12l Our main contribution is a provably good multi-commodity flow based algorithm that, 
for a given buffer site map, finds a buffered global routing minimizing buffer and wire congestion subject to 
given constraints on routing area (wirelength and number of buffers) and sink delays. 
The key features of our approach include the following: 

• Our implementation permits detailed floorplan evaluation in that it enables computing the tradeoff 
curve between routing area and wire/buffer congestion under any combination of delay and capacity 
constraints. 

• Like the allocation heuristic in our algorithm enforces maximum source/buffer wireloads and con- 
trols congestion by taking into account routing channel capacities and buffer site locations. At the 
same time, like the buffer-block planning algorithm in our algorithm takes into account individual 
sink delay constraints. 

• Simultaneously, our algorithm performs buffer and wire sizing by taking into account given libraries 
of buffer types and wire widths, and integrates layer and pin assignment (the latter with virtually no 
increase in runtime). Soft pin locations are modeled as multiple sites (grid locations), and are enabling 
to solution quality. 

The rest of the paper is organized as follows. In Section [21 we formalize the buffered global routing 
problem for 2-pin nets. Then, in Section 13.21 we reformulate the problem as a minimum cost integer 
multicommodity flow problem (with capacities on sets of edges), give an efficient algorithm for finding near- 
optimal solutions to the fractional relaxation, and show how to convert fractional solutions to near-optimal 
routings by randomized rounding, fn Sections sec.multipin and|5lwe show how our approach can be extended 
to handle multipin nets as well as pin assignment, polarity constraints imposed by the use of inverting buffers, 
buffer and wire sizing, and prescribed delay upperbounds. We conclude the paper with experimental results 
detailing the scalability and limitations of our algorithm and comparing it to the heuristic in 0] . 

2 Problem formulation 

In this section we formulate the buffered global routing problem. To simplify the presentation, we ignore pin 
assignment flexibility and assume that there is a single (non-inverting) buffer type and a single wire width. 
We further assume that only buffer wireload constraints must be satisfled (i.e., ignore delay upper-bounds), 
and that each net has 2 pins only. Extensions of our approach to pin assignment, polarity constraints induced 
by the use of inverting buffers, buffer and wire sizing, timing constraints, and multi-pin nets arc discussed 
in Section [S] 

For a given floorplan and tile size, we construct a vertex- and edge- weighted tile graph G = (V, E, h, w), 
6 : y ^ IN, w : £: N, where: 

• is the set of tiles; 

• E contains an edge between any two adjacent tiles; 

• For each tile v d V , the buffer capacity h(v) is the number of buffer sites located in v\ and 

• For each edge e — {u,v) G E, the wire capacity w{e) is the number of routing channels available 
between tiles u and v. 

We denote hy Af = {Ni, N2, ■ ■ ■ , Nk} the given nctlist, where each net Ni is specifled by a source and a 
sink ti. 

A feasible solution to the buffered global routing problem seeks for each net Ni an s^-t^ path Pj buffered 
using the available buffer sites (see Figure n~T|l such that the source vertex and the buffers drive each at most 
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U units of wire, where U is a given upper-bound (the example in Figure [lTI has U — 5). FormaUy, a. feasible 
buffered routing for net Ni is a path Pi — {vq^vi, . . . ,vi-) in G together with a set of buffers Bi C {wg, ...,«;.} 
such that: 

• Vq^ Si and vi. = t^; 

• w{vi-i,Vi) > 1 for every i = 1,. . . 

• H^i) ^ 1 fo'' every Vi € Bi] and 

• The length along Pi between vq and the first buffer in Bi, between consecutive buffers, and between 
the last buffer and . , are all at most U. 

We will denote by TZi the set of all feasible routings (Pi, Bi) for net Ni. Given buffered routings {Pi, Bi) E 
TZi for each net Ni, the relative buffer congestion is 

\{^--v<^ B.}\ 
a — max 

vev b{v) 

and the relative wire congestion is 

|{» : e e P.}\ 
V = max — 

eeE w{e) 

The buffered paths {Pi,Bi), i = 1,. ..,k, are simultaneously routable iff both /i < 1 and i' < 1. To leave 
resources available for subsequent optimization of critical nets and ECO routing, we will generally seek 
simultaneous buffered routings with buffer and wire congestion bounded away from 1. Using the total wire 
and buffer area as solution quality measure we get: 

Integrated Global Routing and Bounded Wireload Buffer Insertion Problem^ 
Given: 

• Grid-graph G = (V^ E, b, w), with buffer and wire capacities 6 : V ^ IN, respectively ui : — s- N; 

• Set JV = {Ni, . . . , Nk} of 2-pin nets with unassigned source and sink pins 5*^, Ti C V; and 

• Wireload, buffer congestion, and wire congestion upper-bounds U > 0, (Iq < 1, and i'q < I. 

Find: feasible buffered routings {Pi,Bi) E TZi for each net Ni with relative buffer congestion /i < /io and 
relative wire congestion v < vq, minimizing the total wire and buffer area, i.e., a^i^i \Bi\ + /3X]i=i 
where a, (3 > are given constants. 

3 Buffered Global Routing Via Multicommodity Flow Approxi- 
mation 

The high-level steps of our approach are the following: 

1. Following Alpert et al. we construct a 2-dimensional tile graph to capture the number and spatial 
distribution of wire routing tracks and buffer insertion sites available in the given floorplan. For 
simplicity, we assume throughout the paper that all tiles have the same size. As discussed in Section|3 
uneven tiling, i.e, using fine tiling in highly congested regions of the design and coarse tiling in regions 
blocked for routing or buffering, can be used to improve the tradeoff between accuracy and solution 
time. 

^Thc problem is called Floorplan Evaluation Problem in but the formulation is useful in post-placement scenarios as 
well. 
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2. We then build an auxiliary graph in which every directed path from a net source to the net's sink 
captures a feasible wire route between them together with locations for the buffers to be inserted on 
this route such that buffer load constraints are satisfied. The auxiliary graph is obtained automatically 
from the tile graph using an original gadget construction (Section 13. 1|) which only increases the size of 
the graph by a linear factor. 

3. We use the auxiliary graph to formulate the buffered global routing problem as an integer linear 
program (ILP). To formally express the ILP, we use a 0/1 variable for each source-sink path, and 
require that exactly one path be chosen for each source-sink pair. The objective is to minimize the 
wire and buffer congestion subject to a given upper-bound on the total wirelength fSection l3.1|l . 

4. We find a near-optimal solution to the fractional relaxation of the above integer program. Although the 
integer program has exponential size (there are exponentially many variables corresponding to source- 
sink paths in the auxiliary graph), we give a combinatorial algorithm which runs in polynomial time 
by representing explicitly only non-zero variables fScction I3.2|l . The algorithm combines the general 
framework for multicommodity flow approximation introduced by |12[ lllj with some of the extensions 
described in [2] and [TU] . 

5. Finally, we round the fractional solution to an integer one using a heuristically enhanced version of the 
randomized rounding method originally proposed in '16^ fSection l3.3|l . 

In this section we detail each step of our approach for the case of 2-pin nets, and also discuss efficient 
computation of the entire tradeoff curve between routing area and congestion for a given floorplan (Section 



3.1 Gadget Graph and Integer Program Formulation 

Recall that, for every feasible buffered routing in the tile graph G = {V{G), E{G),b,w), the wireload of 
the source and of each buffer must be at most U. We start by defining an auxiliary directed graph H 
which captures exactly these feasible buffered routings (see Figure 11.211 . The graph H has U + 1 vertices 
v'~' , . . . for each vertex v e V{G). The index of each copy corresponds to the remaining wireload 
budget, i.e., the number of units of wire that can still be driven by the last inserted buffer (or by the net's 
source). Buffer insertions are represented in the gadget graph by directed arcs of the form {v^,v^) — 
following such an arc resets the remaining wireload budget up to the maximum value of U. Each undirected 
edge (m, v) in the tile graph gives rise to directed arcs {u^ ,v^~^) and {v^ ,u^^^), j = 1, . . . , J7, in the gadget 
graph. Notice that the copy number decreases by 1 for each of these arcs, corresponding to a decrease of 1 
unit in the remaining wireload budget. In addition, we add to H individual vertices to represent net sources 
and sinks. Each source vertex is connected by a directed arc to the U-th copy of the node representing the 
enclosing tile. Furthermore, all copies of the nodes representing enclosing tiles are connected by directed 
arcs into the respective sink vertices. 
Formally, the graph H has vertex set 



El- 



V{H) 



{s,, t,\l<i<k}U{v^\ve V{G),0 <j<U} 



and arc set 



E{H) 



Esrc U Es^nk U ( |J Eu,v) U ( |J E^) 



iu,v)eE(G) vGV(G) 



where 



E, 
E, 



'V 



'src 



{{si, v^) I tile V contains s^, 1 < « < fc} 
{{v^ ,ti) I tile V contains ii, < j < [/, 1 < i < fc} 
{(u^z;^-1),(^;^u^-1)|1<j <[/} 
{(z;^^;^)|0<j < U} 
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Each directed path in the gadget graph H corresponds to a buffered routing in the tile graph, obtained 
by ignoring copy indices for tile vertices and replacing each "buffer" arc {v^ , v^) with a buffer inserted in tile 
V. Clearly, the construction ensures that the wireload of each buffer is at most U since a directed path in H 
can visit at most U vertices before following a buffer arc. Therefore, we get: 

Lemma 1.1 There is a 1-to-l correspondence between the feasible buffered routings for net Ni in the tile 
graph G and the Si-ti paths in H. 

We will use the correspondence established in Lemma fl.ll to give an integer linear program (ILP) for- 
mulation for the buffered global routing problem. Let Vi denote the set of all simple Si-ti paths in H. We 
introduce a 0/1 variable Xp for every path p g P := iiiVi- The variable Xp is set to 1 if the buffered routing 
corresponding to p G Pi is used to connect net Ni, and to otherwise. With this notation, the buffered 
global routing problem can be formulated as follows: 

min^ (a^ |pni;„| + /3 ^ |p n (LI) 

pfz-p veV{G) {u,v)eE(G) 

subject to 

E \pnE,\xp< fiob{v), v€V{G) 
peV 

J2 \pn Eu,v\xp < iyow{u,v), {u,v) € E{G) 
peV 

^2:^ = 1, i = 1, . . . , k 

peV, 

Xpe{0,l}, peV 

ILP Hl.l|l is similar to the "path" formulation of the classical minimum cost integer multicommodity flow 
problem pp. The only difference is that capacity constraints on the edges and vertices of the tile graph G 
become capacity constraints for sets of edges of the gadget graph H (see Figure 11.211 . We note that the 
buffered global routing problem can be represented more compactly by using a polynomial number of edge- 
flow variables instead of the exponential number of path-flow variables Xp. However, wc use formulation Hl.l|) 
since it is the natural setting for describing the approximation algorithm in next section. The exponential 
number of variables is not impeding the efficiency of the approximation algorithm, which, during its execution, 
represents explicitly only a polynomial number of paths with non-zero flow. 

3.2 The approximation algorithm 

In this section we give an efficient approximation algorithm that can be used for solving the fractional 
relaxation of ILP (|l.l|l . Using an approach similar to that used in 12 for solving the minimum cost 
concurrent multicommodity flow problem (see also [2]), instead of solving the relaxation of ILP Hl.ll) directly 
we introduce an upper bound D on the wire and buffer area and consider the following linear program (LP): 

minA (1.2) 

subject to 

E (a E \pnE,\+f3 E \pnEu,v\)xp< X D 

p^J>^veV(G) {u,v)eE{G) ' 

E \pr\E,\xp<\p.Qb{v), weF(G) 
peV 

E |pn Xp < A i/Q t«(u, w), {u,v) e E{G) 
peV 

E '^p ~ i = 1, . . . , k 

peV, 

xp>0, peV 
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Let A* be the optimum objective value for LP l|1.2|l . Solving the fractional relaxation of ILP Hl.l|l is 
equivalent to finding the minimum D for which A* < 1. This can be done by a binary search which requires 
solving the LP H1.2|l for each probed value of D. A lower bound on the optimal value of D can be derived 
by ignoring all buffer and wire capacity constraints, i.e., by computing for each net Ni buffered paths p € Vi 
minimizing a ^ \p H Ey\ + P ^ \p D E^^vl- A trivial upper bound is the total routing area available, 

veV{G) {u,v)eE{G) 

i.e., -Dmax = Mo X) ^i"*^) ~^ P^o w{u,v). In particular, unfeasibility in the fractional relaxation of 

veV{G) (u,v)eE(G) 

ILP is equivalent to A* being greater than 1 when D = Umax: and can therefore be detected using the 
algorithm described below. 

The algorithm for approximating the optimum solution to LP (|1.2|l (see Figure uses the general 
framework for multicommodity flow approximation introduced in jl2| combined with ideas similar to those 
in |10| for efficiently handling set capacity constraints. The algorithm relies on simultaneously approximating 
the dual UP: 

k 

max li (1-3) 

i=l 

subject to 

v&V{G) (u,v)GE{G) 

J2 \pnE^\{y^ + au)+ E \p<^Eu ,v \ (^^U,V + f3u) >k, p eVi 
veV{G) {u,v)eE{G) 

Vv>0, v&V{G) 
Ze > 0, e G E{G) 

The algorithm starts with trivial solutions for LPs H1.2|) and and then updates these solutions over 

several phases. In each phase (lines 5-16 in Figure one unit of flow is routed for each commodity; a 
feasible solution to LP H1.2|l is obtained in line 17 after dividing all path flows by the number of phases. 
Commodities are routed along paths with minimum weight with respect to weights of + au for arcs in 
Ey, V & V{G), of Zu,v + I3u for arcs in Eu^v, (m, v) G E{G), and of for all the other arcs (cf. LP ^l.'d\ ). The 
dual variables are increased by a multiplicative factor for all vertices/edges on a routed path; this ensures 
that dual weights increase exponentially with usage and thus often used edges are subsequently avoided |12j . 

Minimum- weight paths are computed in line (11) of the algorithm using Dijkstra's single-source shortest 
path algorithm. To reduce the number of shortest path computations, paths are recomputed only when their 
weight increases by a factor of more than (1 -t- 75) (see the test in line 9). This speed-up idea, first applied 
in |11| for the maximum multicommodity flow problem, has been shown in 2| to decrease the running time 
in practice while maintaining the same theoretical worst-case runtime. 

Theorem 1.1 The algorithm in Figure 031 finds an (1 + Eq)- approximation with O(p^fclogn) minimum- 
weight path computations, using e^min | - , -(\/l + £0— 1), \ )^^^) } Q^*^ 6— ( ^^^, )^^"; where n and 
m are the number of vertices and edges of G, and e' := e(l + £)(1 + £7). 

Proof: Let t be the total number of phases executed by the algorithm. We will prove that if the algorithm 
had stopped one phase before the last one, namely after i — 1 phases, the solution would have had the desired 
approximation ratio. 

Let yi^'^\ Zu,'v^ and u'^'''*) be the value of variables yv, z„^„ and u after net i has been considered in phase 
r and the variables have been increased in line 14, and let yi^^ :— y^v'^\ Zul :— Zu,'v^ and u^'"-' u^"^'^^ be 
the value at the end of phase r. At the beginning we have 

veV(G) iu,v)eE{G) veV{G) («,«)G-B(G) 

(1.4) 
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When the dual variables yy, Zu,v and u are increased in line 13 after a path pi G Vi has been found, we have 
ev 

,0 E b(v)y^--'^ (l + + .0 E Mu, v)^-'' (l + e^^S^) 



v&V{G) (u,v)eE{G) 

E biv)yi'-^'-'Ul + 

v£V{G) ^ . •■ ^ , , (u,v)£E(G} 



( a E \v^r\Ey\ + /? E Ik n 

■uey(G) (u,v)^E{G) 

1 + e 



V 



D 



i>Gl^(G) (jj,ii)e-E(G) 

\iiGV(G) (u,v)(iE[G) J 

In the following phases a path for net i is only computed if the cost of the last path found has increased 
by more than (1 + ^e). Since the dual variables are increased only during the algorithm, we always have at 
the beginning of line 13: 

veV(G) (u,v)eE(G) 

< (1 + je) [ min ^ |p n (y^'') + au^'')) + J] |p n (z.^ + 

which means that at the end of phase r we have: 

ijevCG) (n,D)e£;(G) 
< Mo E b{v)y^J-^'> + E 

iiGVCG) {u,v)eE{G) 



-£(l + 7e) V min f E IP.nS^Ky^' 



E \P:riEu,y\{z^u,l+Pu^'-'>)\ (1.5) 
(u,i;)e£;(G) / 



For brevity, we set e' := £(1 + je). By linear programming duality, the expression 



E min ( E b^n£;,|(y.[,"^+auM)+ E Ift ^ (^S + 



i=ipeVi \vev{G) {u,v)eE{G) 



Mo E &(^^)2/'l''^ + 1^0 E ■w{u,v)z^}, + Du'-'^^ 

veViG) {u,v)eE(G) 

(r) 

is a lower bound on the maximum relative congestion, that is A^^^ < A* . 
With this, inequality p.5|l can be rewritten as 

Mo E K'")yv''^ + '^o E 'w{u,v)z'i!'}v + Du'^'~'> 

veV{G) {u,v)eE{G) 

< MO E b{v)yi''~'^ + 1^0 E wiu,v)z^J,^''^ +Du<^--'^ 

veV(G} {u,v)eE{G) 

+e'A|n^o E b{v)yi''^ + 1^0 E w{u,v)zi',l + Du^^^ 

\ veV{G) (u,v)eE{G) . 
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which can be transformed to 



vGViG) {u,v)eE{G) 

1 - e'X\l^' \ v£V{G) {u,v)£E(G) ^ 

If we set Xib := max A[r\ we get with equation (|1.4|l : 

(n + m + 



< 



< 



(1 - e'Xib) V 1 - e'A 
(n + TO + / e'A;6 



(1 - e') V 1 - e' 

(n + m + l)6 ^'^'"(^-1) 

^ ^ P 1— e' 

(l-eO 



(1.6) 



For the last inequahty, 1 + x < for a; > is used. 

An upper bound on A for the solution computed by the algorithm can now be derived: We will demonstrate 
this for the second constraint in p.2|l . for the first and third constraint the same upper bound can be obtained 
similarly. 

Suppose node v is used s times by some path during the first t—1 phases, and let the jth increment, by 
which the constraint ^^If^y^ Spep \pf^Ey\xp < A is incremented, be aj := ^^^(t,) IPiH-E^I for the appropriate i. 

,(0) - s 



After rescaling the variables Xp, p e V, we have ^J^^^ J^peVi IP^^^^l^p = /(*-!)• Since yi, - ^^^^^^^ 



and j/e* < — rr^ (because the condition in line 4 still holds before the last phase is executed) and since 



fJ-oKv) j=l 



we get 

S « 1 
Mofe(w)/=i fioKv) 
Since (1 + e)" < 1 + ea for < a < 1 (the expression (1 + e)" is convex in a, 1 + ea is linear and we have 
equality for a = and a = 1), it follows 

and hence we get 

Mo^(w) ^ t-1 t-1 

As we can derive the same bounds for the first and second constraint, we have 

X (t-1) < logi+e (?) J. 
- t-1 ' ^ ' ' 

Since fxo b{v)yv''^ +vo Yl, w{u, v)zu,v + Du^^'^ > 1, solving inequality H1.6() with r = t for A/b gives 

vGV(G) (u,v)£E(G) 

a lower bound on the optimum solution value: 
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from which together with 1)1. 7|l we get an upper bound on the approximation ratio p: 

P < < 



A/, 



i-g' In ( ^ 

E'(t-l) V (n+m+l)5 y 



\^ {n+m+l)<5 J 



1 

If (5 is now chosen to he S :— { , ) , we get 

ln(|) 



such that 



In 

P < 
< 



If e < i , we have 1 + 76 < 2 and we get 



(n+m+l)5 y 



(l-£')^ln(l + e) 
£(1 + 7£) 

(l-£(l + 7e)f (e-f) 

1 +7£ 

(l-£(l + 7e)f (1-1) 



l+7£ 

P < 



(1 - 2e)3 

If e is chosen such that 1 + 72 < VT~r^ and < Vl + £0, so we choose 



e = mm 



we get p < 1 + £o- After all, we have e — O(eo) and also e' — O{eo)- 
From H1.7|l we get that 



A* < 



t-1 ' 

which means that the maximum number of phases is bounded by: 

t < i + i^siiili) 

A* 

1 / n + m + 1 

= 1 + In 



ln(l + e)£'A* V 1 - £' 



The last expression can be bounded by O i^-^Tjr In nj . □ 
Remark 1. The dependence on A* in Theorem 11.11 can be eliminated by a scaling technique described 
in |11| . Thus, using a Fibonacci heap implementation of Dijkstra's algorithm to compute minimum- weight 
paths leads to a runtime of 0(^k(m + nlogn) logn) for the algorithm in Figure [T3I 

Remark 2. Using ideas from 0, it can be shown that the algorithm in Figure lTTSl not only minimizes A, but 
also "strives" for a lexicographically minimum solution with respect to the vector consisting of the relative 
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buffer congestion of the vertices, the relative wire congestion of the edges, and the ratio between the total 
routing area and the upperbound D. This is particularly useful for the case when the floorplan is unroutable 
using the given buffer sites and wire tracks, since then the solution returned by the algorithm indicates where 
we should add more routing resources (by local perturbations of the floorplan) to reach routability. As noted 
above, this case is handled by running the algorithm with D = Dmax- If we want to completely ignore the 
constraint on total routing area (i.e., set D = oo), the dual variable w is throughout the whole execution 
of the algorithm and can thus be eliminated. 

Remark 3. In a practical implementation, line (2) of the algorithm, which requires setting to zero an 
exponential number of variables, is not implemented explicitly. Rather, the algorithm keeps track only of 
the paths with non-zero flow, i.e., those paths for which flow is augmented in line (13). Several alternatives 
for memory efhcient representation of the non-zero flows are discussed in next section. 

3.3 Randomized Rounding 

In the previous section we have presented an algorithm for computing near-optimal solutions to LP 11.21 
The last step in solving the buffered global routing problem is to convert these fractional flows into feasible 
buffered routings for each net. We follow the randomized rounding technique proposed by Raghavan and 
Thomson ^B], and route each net Ni by randomly choosing one of the paths p E Vi, where the probability of 
choosing path p is equal to the fractional flow Xp (recall that ^ Xp = 1, i.e., Xp is a probability distribution 

over Vi). Since the fractional flows satisfy buffer and wire congestion constraints, it follows that (for large 
enough capacities) the relative congestion after rounding increases only by a small factor jl6j . 

A direct implementation of the randomized rounding scheme requires storing explicitly all paths with 
non-zero flow. However, this is typically unfeasible due to the large memory requirement. An alternative 
implementation, originally suggested by '16', is to compute edge flows instead of path flows during the 
algorithm in Figure ^31 and then implement randomized rounding by performing a random walk between 
the source and sink of each net. As noted in |10| . performing the random walks backwards, i.e., from sinks 
towards sources, leads to reduced congestion for the case when a significant number of the 2-pin nets result 
from decomposition of multi-pin nets. 

A simpler implementation, which requires storing a single path per net, is to interleave randomized 
rounding with the computation of the fractional flows Xp. In this implementation, we continuously update 
the path selected for each net, as follows. In first phase, the single path routed for each net becomes the 
net's choice with probability 1. In iteration r > 1, the path routed for net i replaces the previous selection 
of net i with a probability of (r — l)/r. It is easy to see that the path selected after t phases was selected by 
the net in phase r — 1, . . . , i with an equal probability of l/r, i.e., the probability that a path p is the final 
selection is equal to the fractional flow Xp computed by the algorithm in Figure 

The results reported in Sectiondwere obtained using yet another implementation. In our implementation 
we save the paths routed for each net in the last K = 5 phases of the algorithm in Figure lOl fnote that the 
K paths resulting for each net need not be distinct). Then, we pick for each net one of the K saved paths, 
uniformly at random. To further improve the results, we repeat the random choices a large number of times 
(10,000 times in our implementation) and keep the choices that result in the smallest congestion or routing 
area (depending on the optimization criteria). We found this scheme, which has still reasonable memory 
requirements, to work better in practice than the other approaches (although, technically, it implements only 
a rough approximation of the probability distribution required by |lf)j). 

3.4 Area and congestion tradeoff curve 

To evaluate a floorplan at an early stage of the design process, it is useful to find not only the minimum 
routing area needed for given bounds on /io and vq on the relative buffer and wire congestion, but also 
how the total routing area increases if we enforce a smaller congestion. Obviously, a floorplan is better if a 
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smaller area increase is needed for the same decrease in congestion. Let us denote by A(/i, ly) the minimum 
routing area needed for a fractional solution with relative buffer and wire congestion not more than fi and ly, 
respectively. In the following, we denote a fractional solution Xp, p for LP (|1.2|) . simply by a vector x. 
Let A{x)^ and v{x) denote the total routing area, buffer congestion, respectively wire congestion of x. 

Lemma 1.2 The function (/i, v) i — > A(/i. u) is convex. 

Proof: Let x be an optimal solution (minimal routing area) for given relative congestions fi and ly, that 
is A{x) — A(/i, v), fj.{x) < II, iy{x) < u, and, similarly, let x' be an optimal solution for relative congestions 

and v' . For any given k G [0,1], we will now construct a solution x" with maximum relative buffer 
congestions at most ^i" = k/^ + (1 — k)^', wire congestion at most v" = kv + {\ — , and area A{x") = 
kA{x') + (1 - k)A{x'), and hence A(p", v") < kA(/x, ly) + {1 - k)A(/i', j/'). 

The solution x" :— kx + {1 — k)x' fulfills all these requirements. It describes a feasible routing: Each 
net i is routed exactly once, because J2 — S {i^^p + (1 ^ = J2 Xp + {1 — k) ^ Xp = 

peVi peVi peVi peVi 

k1 + (1 — k)1 = 1, and we have Xp = nxp + (1 — Kjx'p > for each p £ V. For the relative congestion we 
get: pix")= max \pnEy\x'^^ max E \p n Ey\ {KXp + {1 ~ k)x' ) < k max i E 

veV(G) ^ ' pg-p veV(G) ^ ' pg-p v£V(G) ^ ' pg-p 

Ev \ Xp + {1 — k) max t-t-^ ^ \p O Ey \ x' < Kfi + {1 — k)p' = /i", and similarly v{x") < v" . Since the area 

is a linear function in x, we get also ^(a:") — kA{x) + (1 — k)A{x'), which concludes the proof. □ 

The following lemma shows that in certain cases we can derive a value A(/i, v) from an optimal solution 
of the linear program H1.2|l . and thus the binary search suggested in Section can be avoided. 

Lemma 1.3 Let x be an optimal solution of LP for given D, fiQ and vq. If there exists a solution x' 
™t/,„iax(^,i4|:)) <max(iigl,i^), then K{p{x),^{x)) = A{x) 

Proof: Suppose there exists a solution x" with A{x") < A{x) and p{x") < h'{x") < iy{x). Then the 

convex combination x'" := x" + e{x' — x"), e > 0, e small, gives a better solution for LP (|1.2II . □ 

This suggests the following approach to computing the full area vs. congestion tradeoff curve. First, 
compute the feasible region (which is also convex) for /i and v by ignoring the constraint on the area. Then 
solve the linear program LP H1.2|l for certain values of Z), /ip and vq. If the solution is on the boundary of 
the feasible region, decrease D such that /z and ly increase, otherwise a new point for the area and congestion 
tradeoff curve has been found. 



4 Handling Multipin Nets 

Although a majority of nets have only two pins, modern designs also contain an increasing number of multipin 
nets. In order to apply our approximation algorithm from the previous section, we need to reduce multipin 
nets to 2-pin nets or otherwise adjust the algorithm to handle mutipin nets. In this section we consider 
several methods for decomposing multipin nets and present an extension of our algorithm to 3-pin nets; 
these alternatives provide a range of trade-offs between runtime and solution quality. 

The standard reduction constructs the minimum spanning tree T over all k pins of a /c pin net and 
then splits the net into fc — 1 2-pin nets each corresponding to a single edge in T. Although the wireload 
can be accurately taken in account, the inherent drawback of this approach is that for high fanout nets 
we may end up with unbalanced and overloaded buffering. Note that the star topology decomposition, in 
which the source is connected with each sink by a separate edge, suffers from the same drawback. Instead 
of spanning trees, we suggest to use buffered Steiner trees. The minimum buffered routing for the entire 
k-p'm net can be found using one of the algorithms from j^. Such routing has been shown to be very 
close to optimal and is convenient for handling high fanout nets - both sink and wire loads are taken into 
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account. The resulting buffered routing tree T connects the vertices of degree 1 (corresponding to terminals), 
degree 2 (corresponding to buffers) and degree 3 and 4 (corresponding to branching points for Steiner tree 
routing) (see Figure iL^ a)). Our approach is to split T into smaller pieces (2- or 3-pin nets) and route them 
separately using the algorithm from the previous section. The resulting pieces have longer total wirelength, 
thus allowing more flexibility for congestion minimization. We distinguish three methods of splitting T. 

Fixed branching and fixed buflfering. The tree T is split into 2-vertex subgraphs by replacing each 
vertex v with the deg{v) copies each corresponding to one of the incident edges (see Figure ^3Ib)). Each 
subgraph corresponds to a single edge of T and has a single source and a single sink. The number of resulting 
2-pin nets is k + p + b — 1, where k is the number of terminals, p is the number of Steiner points and b 
is the number of buffers in T. This decomposition is the least flexible - the positions of both buffers and 
Steiner points are fixed. Routing- and buffer-congestion minimization may require using very long detours 
with possible addition of extra buffers. 

Fixed branching and flexible buffering. The tree T is split into 2-vertex subgraphs by replacing each 
branching vertex v with the deg(y) copies, each corresponding to one of the incident edges (see Fieure lL^ c')'). 
Each subgraph corresponds to a single edge of the unbuffered version of T . The number of resulting 2-pin 
nets is k-\-p—l. Similar to the previous decomposition, there is limited room for rerouting but now the buffers 
can be shifted between grid cells resulting in much better opportunities for buffer-congestion minimization. 
Since the buffer insertion may be caused by large sink load (e.g., in Figure n~W c) net (s,pi) requires a buffer 
because of the 3 sinks downstream), it is necessary to compensate the upper bound U for some nets. This 
can be easily done by decreasing the level of the source of such net as follows. Normally, the source of a net 
is placed at the highest node of the edge gadget, e.g., on Figure W?R but if compensation is necessary, 
then the source should be placed lower, e.g., it^, thus requiring a buffer much sooner. 

Half-flexible branching and flexible buffering. Assume for simplicity that all Steiner points in T have 
degree 3 i.e., no degree-4 Steiner points are allowed. The vertices of the unbuffered tree T can be colored 
into 2 colors such that adjacent vertices have different color. If we fix locations of the Steiner points of the 
same color then all the resulting nets will have either 3 or 2 terminals (see Figure ^3Id)). If we pick the 
color with the least amount of Steiner points, then we can be sure that at most half of all Steiner points are 
fixed implying that the number of resulting nets is at most k — \. Indeed, at least p/2 Steiner points are 
not fixed and the corresponding three 2-pin nets for fixed branching are replaced with one 3-pin net. This 
reduces the total number of nets with respect to fixed branching by at least 2p/2 = p. 

This decomposition is the most flexible one, giving most opportunities for routing and buffer-congestion 
minimization. Unfortunately, it requires runtime-costly adjustments to the approximation algorithm from the 
previous section. Figure [T31 gives the modified subroutine for computing feasible routings having minimum 
weight with respect to the dual variables. We assume here that the possible locations of the source pin for a 
net Ni are specified by Si as before, while the two sinks are specified by sets and Tf. In the graph H we 
have vertices t\ and and edge sets {(w-*, 4) 1^ ^ ^iiJ = ^i---^U}, I — 1,2 for the sink pins of such a 3-pin 
net. For each possible Steiner point v, the algorithm tries all possible lengths j and k to the first buffer on 
the path from v to t\ and respectively to if. 

5 Extensions 

In this section we describe how our approach to buffered global routing can be extended to handle pin 
assignment, polarity constraints imposed by the use of inverting buffers, buffer and wire sizing and prescribed 
delay upperbounds. The modifications required to handle these extensions involve only changes to the gadget 
graph described in Section 13.11 but not to the approximation algorithm in Figure 11.31 or to the randomized 
rounding scheme. 
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5.1 Pin Assignment 

At the early stages of the design flow there is a significant degree of flexibiUty available for pin assignment, 
and therefore the ability to exploit this flexibility can have a major impact on the quality of resulting global 
routings. Consideration of pin assignment requires only two small changes in the construction of the gadget 
graph described in Section ITTl (1) source vertices Si must now be connected by directed arcs to the U-th 
copies of all nodes representing enclosing tiles, and (2) copies 0, . . . ,U oi all nodes representing enclosing 
tiles must be connected by directed arcs into the sink vertices ti. Reading pin assignments from the paths 
selected by randomized rounding is trivial: we assign to each source (sink) an arbitrary pin in the tile visited 
first (last) by the selected path for the net. 

The size of the gadget graph remains virtually unaffected by the pin assignment modification: for k nets 
we only add 0{k) edges to the gadget graph under the realistic assumption that each pin can be assigned 
to at most 0(1) tiles. Therefore, the time required to find minimum-weight paths, and hence the overall 
runtime of the algorithm in Figure [T31 does not increase even though the number of paths available for each 
net increases when considering pin assignment. 

5.2 Polarity Constraints 

The basic problem formulation in Section [21 considers only a non-inverting buffer type. In practice, inverting 
buffers are often preferred since they occupy a smaller area for the same driving strength. Although the use 
of inverting buffers introduces additional polarity constraints, which may require a larger number of buffers 
to be inserted, overall inverting buffers may lead to a better overall resource utilization. Algorithms for 
bounded capacitive load inverting (and non- inverting) buffer insertion have been recently discussed in 0. 
The focus of [Hj is on single net buffering, with arbitrary positions for the buffers. In our approach, the goal 
is to minimize the overall number of buffers required by the nets, and buffers can be inserted only in the 
available sites. 

Consideration of polarity constraints require is achieved by modifying the basic gadget graph given in 
Section ITTI as follows (see Figure Each node of the basic gadget is replaced by an "even" and "odd" 

copy, i.e., is propagated into Weuen ^'^^ v^^^. Tile-to-tile arcs are replaced by two arcs connecting copies 
with the same polarity, e.g., the arc (u% w'~^) gives rise to {u\,^^^, v]^ln) ^-^id (^odd' "^odd)- ^ path uses such 
an arc, then it does not change polarity. Instead, each buffer arc changes polarity, i.e., {v^,v^) gives rise to 
(Wg„g„, w^^) and (Wod^jW^en)- The gadget also allows two inverting buffers to be inserted in the same tile 
for the purpose of meeting polarity constraints. This is achieved by providing bidirectional arcs connecting 
the U-ih even and odd copies of a tile w, i.e., {u^vem^odd) ^^'^ ("^odd' ^even) ■ Finally, source vertices Si are 
connected by directed arcs to the even U-th copy of enclosing tiles, and only copies of the desired polarity 
have arcs going into sink vertices ti. 

5.3 Buffer and wire sizing 

Buffer and wire sizing are well-known techniques for timing optimization in the final stages of the design 
cycle in). However, early buffer and wire sizing can be equally effective for reducing congestion and/or wiring 
resources. In this section we show how to incorporate buffer and wire sizing during in our algorithmic frame- 
work for global buffered routing. The key enablers to these extensions are again appropriate modifications 
of the gadget graph. 

The gadget for buffer sizing is illustrated in Figure [TTTT al for two available buffer sizes, one with wireload 
upperbound J7 = 4 and one with wireload upperbound U = 2. The general construction entails using a 
number of copies of each tile vertex equal to the maximum buffer load upperbound U . For every buffer 
with wireload upperbound of U' < U, we insert buffer arcs {v'^,v^ ) for every < i < U' . Thus, the copy 
number of each vertex continues to capture the remaining wireload budget, which ensures the correctness of 
the construction. 
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Wire sizing (and a coarse form of layer assignment) can be handled by a different modification of the 
gadget graph (see Figure lTTTT b)). Assuming that per unit capacitances of the thinner wire widths arc rounded 
to integer multiples of the "standard" per unit capacitance, the gadget models the use of thinner segments 
of wire by providing tile-to-tile arcs which decrease the tile copy index (i.e., remaining wireload budget) 
by more than one unit. For example, in Figure [rTTT b). solid arcs {u^,v^~^) and {v'^,u^~^) correspond to 
standard width connections between tiles u and v, while dashed arcs (m*,w'~^) and (w*,u*^^) correspond to 
"half-width" connections, i.e., connections using wire with double capacitive load per unit. 

While our models for buffer and wire sizing are rather coarse (e.g., we truncate all buffer wireload 
upperbounds to integer multiples of the tile dimension, ignore variations in input capacitances of buffers and 
sinks, etc.), we consider them to be sufficiently accurate first-approximations for driving these optimizations 
during the early physical design stages. 

5.4 Delay Constraints 

In we have proposed a method for enforcing given sink delay constraints based on charging wiresegment 
delays to buffer arcs in the gadget graph, and using a routine for computing minimum-weight delay con- 
strained paths instead of Dijkstra's algorithm in the algorithm for approximating the fractional solution to 
ILP 1)1. 1() . Here we give a different method for handling sink delay constraints. The new method is similar 
in spirit to the constructions in previous sections, relying exclusively on a modification of the gadget graph. 
In general, our construction applies for any delay model, such as the Elmore delay model, for which (1) the 
delay of a buffered path is the sum of the delays of the path segments separated by the buffers, and (2) 
the delay of each segment depends only on segment length and buffer parameters. However, for the sake of 
efficiency, segment delays would have to be rounded to relatively coarse units. 

Figure lO^ shows the gadget construction for the case when delay is measured simply by the number of 
inserted buffers. The idea is again to replicate the basic gadget construction, this time a number of times 
equal to the maximum allowed net delay. Within each replica, tile-to-tile arcs decrease remaining wireload 
budget by one unit. In order to keep track of path delays, buffer arcs advance over a number of gadget replicas 
equal to the delay of the wiresegment ended by the respective buffer (this delay can be easily determined 
for each buffer arc since the tail of the arc fully identifies the length of the wiresegment). The construction 
is completed by connecting net sources to the vertices with maximum remaining wireload budget in the "0 
delay" replica of the gadget graph, and adding arcs into the sinks from all vertices in replicas corresponding 
to delays smaller than the given delay upperbounds. 

Remark. An interesting feature of the resulting gadget graph is that it is acyclic. Hence, we can now 
compute minimum- weight paths in the approximation algorithm in Figure ^31 by computing the distances 
from the source via a topological traversal of the graph in 0{m + n) time instead of the 0{m + n log n) time 
needed by Dijkstra's algorithm. 

6 Practical enhancements 

In this section we point to several enhancements which can improve the overall speed and accuracy of our 
algorithm. 

Uneven tile size. Increasing the accuracy of our method can be achieved by decreasing the tile size. 
However, this results in significant increases in running time. Furthermore, when the tile size decreases 
beyond a certain point, the channel widths and the number of buffer sites per tile may become so small 
that the accuracy of the randomized rounding is greatly reduced. Ideally, the channel widths and buffer 
sites per tile should be approximately the same for all tiles. Indeed, if a tile is too crowded, then we can 
miss potential congestion violations, and if a tile is too sparse then the solution of LP relaxation cannot be 
rounded accurately. Our approach is to use uneven tile sizes in order to achieve evenly populated tiles. This 
approach can be implemented by choosing appropriate target values for channel width and buffer sites per 
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tile and, starting with a coarse grid, recursively partitioning the overpopulated tiles into 4 equal sub-tiles 
until the target tile occupancy is reached. 

Window buffer constraints. Our model assumes that the block design reserves room for buffer sites 
such that each block tile is assigned a specific number of buffer sites. Our approach can handle not only 
constraints on the number of buffer sites in each tile but additional constraints on the total number of buffers 
in a set of tiles (i.e., windows). For instance, these additional constraints may explicitly bound total number 
of buffers in a given block. 

Improved dual-update rules. The algorithm in Fierure [Lo uses a multiplicative update rule for the dual 
variables: In each phase the dual variable corresponding to a set of edges E' is multiplied by a factor of 
(l+ex), where x is the ratio between the flow increase through E' and the capacity of E' . This is not the only 
update rule that guarantees convergence. Another example is to update the dual corresponding to E' by e^^. 
In 12] it is experimentally found that, for a similar multicommodity flow algorithm, this exponential update 
rule is more robust, i.e., guarantees convergence over a wider range of values for e. Further improvements in 
runtime and solution quality can be achieved by computing an optimum update factor in each phase using 
Newton's method (see for details). 

7 Experimental results 

In this section we report results for a 2-pin net implementation of our multicommodity flow based algorithm. 
All experiments with our algorithm were conducted on a 360 MHz SUN Ultra 60 workstation with 2 GB 
of memory, running under SunOS 5.7. The algorithm was coded in C and compiled using g++ 3.2 with 
-04 optimization. Unless otherwise noted, the value of precision parameter e in the multicommodity flow 
algorithm was set to 0.3, and the number of iterations was limited to 64. 

We tested the algorithm on the 10 circuits from j^, which are derived from testcases first used by [H]. We 
used the same circuit parameters as in the experiments for 2-pin nets of 0j ; these parameters are summarized 
in Table n~n As in |3] and |H| , we decomposed multipin nets into 2-pin nets by making direct connections from 
the source of a net to each of the net's sinks. Therefore, the numbers of 2-pin nets in Table [TTTl correspond 
to the numbers of sinks in 0] . We note that these numbers are slightly smaller than those reported in fS| 
since 3j retained only the nets requiring buffer insertion under the algorithm of We also note that the 
numbers of buffer sites used in our experiments are those used by Alpert et al. ^ in the experiments in which 
multi-pin nets were decomposed into 2-pin nets. These numbers were obtained directly from the authors, as 
they do not appear explicitly in 1^ (the numbers of buffer sites given in Table 1 of ^ were used only in their 
experiments with un-decomposed multi-pin nets; see |S] for more details on these experiments). Finally, we 
note that although we use the same grid sizes as there are some small differences in tile areas between 
Table [TTTI and jjj. These differences, which are probably due to the different procedures used in rounding 
tile dimensions, are unlikely to affect to a measurable degree the results of the compared algorithms. 

Table [L^ shows the results of the multicommodity flow algorithm (with pin assignment) when run with 
D — GO, i.e., when the objective is to minimize the wire and buffer congestion only. The table shows that 
progressively better fractional solutions are obtained by the approximation algorithm. The results also show 
the tradeoff between congestion on one hand and wiring resources (number of buffers and wirelength) on the 
other hand. 

Table [T3I gives the results for wirelength minimization (i.e., using a ~ and /3 = 1) subject to wire 
and buffer congestion constraints (/ig = 1.0 and = 1.0). In these experiments the multicommodity flow 
algorithm is run once per testcase (without binary search), with D equal to the lower bound computed 
by routing each net optimally without taking into account capacity constraints. The multicommodity flow 
runtime includes randomized rounding (10000 trials, as described in Section l3.3|l . RABID runtime is for an 
RS6000/595 workstation with 1Gb of memory, as reported in 

The wirelength of the global routing obtained by our algorithm without pin assignment (MCF) is always 
within 1.03% of the lower bound. In contrast, the RABID heuristic of 0] exceeds the lower bound by 
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5.64 — 11.87%. To evaluate the effect of simultaneous pin assignment, we have added the possibihty for 
each sink to be positioned not only in the given tile, but also in the 3-8 surrounding tiles (see Table 1 for 
the average number of tiles per pin of each testcase). Running our algorithm with pin assignment enabled 
(MCF+PA) further decreases wirelength by « 10%, while being within at most 0.15% of the corresponding 
lower bound. We note that routing and pin assignment is performed by our algorithm in virtually the same 
time as routing alone. 

Tables 11.41 gives results for the extension of the multicommodity flow algorithm to inverting buffer in- 
sertion, which is about twice slower than non-inverting buffer insertion due to the doubling in size of the 
gadget graph. Inverter insertion leads to a very small increase in the number of buffers (due to the need to 
satisfy polarity constraints) which is easily compensated by the smaller size of inverters. At the same time, 
inverter insertion requires virtually the same wirelength and often gives improved congestion (except for the 
xerox testcase) . 

Table [T31 gives runtime scaling results for the extension of the multicommodity flow algorithm to buffered 
global routing with delay constraints. We note that the algorithm becomes faster for very tight delay 
constraints, since the number of nets that can meet delay constraints is only a fraction of the total number 
of nets. For moderately tight delay constraints almost all nets become routable, yet the runtime is comparable 
to that of the unconstrained version of the algorithm. For very lax delay constraints all nets become routable, 
and the runtime becomes significantly higher than that of the delay-oblivious version of the algorithm, by 
a factor roughly proportional to the increase in the size of the gadget graph, i.e., the maximum delay 
upperbound. However, large delay constraints are not very useful since they are satisfied almost in totality 
by using the unconstrained version of the algorithm. 

8 Conclusions 

In this paper we have presented the first provably good approach to buffered global routing with simultane- 
ous timing- and congestion-driven buffered global route planning, pin and layer assignment, and wire/buffer 
sizing. The experimental results show that our method significantly outperforms approaches based on cas- 
cading individual optimizations such as the recent RABID algorithm of Alpert et al. |4|. Future work aims 
to incorporate in our implementation practical improvements such as the use of uneven sized tiles, window 
constraints on buffer usage (as opposed to tile constraints), and faster-converging dual- update rules. 

Acknowledgments 

This work has been supported in part by Cadence Design Systems, Inc., the MARCO Gigascale Silicon 
Research Center, and NSF Grant CCR-9988331. Preliminary versions of these results have appeared in 
[HlEl- The authors wish to thank Charles Alpert, Jason Cong, and Jiang Hu for kindly providing us with 
the testcases used in |5] and 0]. 

References 

[1] Ahuja, R. K., T. L. Magnanti, and J. B. Orlin, Network Flows: Theory, Algorithms, and Applications, Prentice- 
Hall, Englewood Cliflts, NJ, 1993. 

[2] C. Albrecht, "Global Routing by New Approximation Algorithms for Multicommodity Flow" , IEEE Transactions 
on Computer- Aided Design of Integrated Circuits and Systems, vol. 20(5), 2001, pp. 622-632. 

[3] C. Albrecht, A.B. Kahng, I.I. Mandoiu, and A.Z. Zelikovsky. "Floorplan evaluation with timing-driven global 
wireplanning, pin assignment, and buffer/wire sizing", Proc. 7th Asia and South Pacific Design Automation 
Conference and 15th International Conference on VLSI Design, 2002, pages 580-587. 



REFERENCES 



17 



[4] C. Alpert and J. Hu and S. Sapatnekar and P. Villarrubia, "A practical methodology for early buffer and wire 
resource allocation", Proc. DAC, 2001. 

[5] C. Alpert and J. Hu and S. Sapatnekar and P. Villarrubia, "A practical methodology for early buffer and wire 
resource allocation" , IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, Vol. 
22(5), 2003, pp. 573-583. 

[6] C. Alport, A.B. Kahng, B. Liu, l.I. Mandoiu, and A.Z. Zclikovsky. "Minimum buffered routing with bounded 
capacitive load for slew rate and reliability control", IEEE Transactions on Computer- Aided Design, 22 (2003), 
pp. 241-253. 

[7] R.C. Garden and C.-K. Cheng, "A global router using an efficient approximate multicommodity multiterminal 
flow algorithm", Proc. DAC, 1991, pp. 316-321. 

[8] N. P. Chen, C.-P. Hsu and E. S. Kuh, "The Berkeley building-block (BBL) layout system for VLSI design", 
VLSI83, Proceedings of the IFIP TC WG 10.5 International Conference on Very Large Scale Integration, Trond- 
heim, Aug. 1983, pp. 37-44. 

[9] J. Cong, T. Kong and D.Z. Pan, "Buffer block planning for interconnect-driven floorplanning" , Proc. ICC AD, 
1999, pp. 358-363. 

"Provably good global buffering by multiterminal multicommodity flow 

[10] F.F. Dragan, A.B. Kahng, I.I. Mandoiu, S. Muddu, and A.Z. Zelikovsky. "Provably good global buffering by 
generalized multiterminal multicommodity flow approximation" , IEEE Transactions on Computer- Aided Design, 
21 (2002), pp. 263-274. 

[11] L.K. Fleischer, "Approximating fractional multicommodity flow independent of the number of commodities", 
SI AM J. Discrete Math. 13 (2000), pp. 505-520. 

[12] N. Garg and J. Konemann, "Faster and simpler algorithms for multicommodity flow and other fractional packing 
problems" , Proc. 39th Annual Symposium on Foundations of Computer Science, 1998, pp. 300-309. 

[13] J. Hu and S. Sapatnekar, "A survey on multi-net global routing for integrated circuits". Integration 31 (2001), 
pp. 1-49. 

[14] J. Huang, X.-L. Hong, C.-K. Cheng and E. S. Kuh, "An Efficient Timing Driven Global Routing Algorithm", 
Proc. DAC, 1993, pp. 596-600. 

[15] I.I. Mandoiu. Recent advances in nulticommodity flow algorithms for global routing. In Proc. 5th Intl. Conf. 

on ASIC (ASICON), pp. 160-165, 2003. 

[16] P. Raghavan and CD. Thomson, "Randomized rounding", Combinatorica, 7 (1987), pp. 365-374. 

[17] F. Shahrokhi and D. W. Matula, "The Maximum Concurrent Flow Problem", J. Assoc. Comput. Mach. 37(2), 
Apr. 1990, pp. 318 334. 

[18] X. Tang and D.F. Wong, "Planning buffer locations by network flows", Proc. ISPD, 2000. 
[19] J. Vygen, Theory of VLSI Layout, Habilitation thesis. University of Bonn, 2001. 



REFERENCES 



Table 1.1: Circuit parameters. 



Circuit 


# 2-Pin 


Grid 


Tile 


w{e) 


Avg. tiles 


U 


#Buffer 




Nets 


size 


area 




per pin 




sites 


a9c3 


1526 


30 X 30 


1.09 


52 


4.9 


6 


32780 


ac3 


409 


30 X 30 


0.49 


26 


5.0 


7 


8550 


ami33 


324 


33 X 30 


0.46 


32 


5.0 


6 


17750 


ami49 


493 


30 X 30 


0.68 


14 


4.8 


6 


11450 


aptc 


141 


30 X 33 


0.36 


13 


5.0 


7 


4200 


hc7 


1318 


30 X 30 


1.04 


28 


4.8 


6 


17780 


hp 


187 


30 X 30 


0.42 


12 


5.0 


7 


2350 


playout 


1663 


33 X 30 


0.78 


120 


4.8 


7 


37550 


xc5 


2149 


30 X 30 


0.58 


50 


5.0 


7 


19150 


xerox 


390 


30 X 30 


0.38 


40 


5.0 


7 


7000 



REFERENCES 



19 



Table 1.2: Congestion minimization results {D = oo) for the multicommodity flow algorithm with s = 0.3. 



Circuit 


Phase# 


Wire 


Buffer 


#Buffers 


Wlen 


CPU 






Congest 


Congest 






sec. 


a9c3 


1 


0.75 


0.80 


3351 


26057 


12.0 




4 


0.59 


0.43 


3356 


26123 


47.5 




16 


0.51 


0.23 


3402 


26595 


188.8 








U. lo 


OOUO 


Z ( OZO 


7'^n 7 




d4-|-11U U in JJ 


0.62 


0.30 


3625 


O TO Q C\ 

.^yyoU 


785 


ac3 


1 


0.77 


1.00 


796 


4998 


3.0 




4 


0.62 


0.53 


797 


5008 


12.2 




16 


0.40 


0.27 


803 


5072 


48.9 








n 1 >i 


OZO 


5211 


1 Q9 




a A 1 Pi^TTTVTT^ 

\3^-\-lWJ \J rH LJ 


U.4D 


U.DU 


f 






ami33 


1 


U.DD 


0.67 


909 


4466 


2.6 




4 


U.OO 


0.36 


908 


447 D 


10.6 




16 


r\ A'? 
U.47 


0.20 


910 


4olo 


42.2 






n An 


n 1 4 

U. 11 


you 




IDo.U 




fiA-UTt OT TATFi 
D^T^xVv^ U 1^ U 


U*OD 


U.O± 




AfiQSi 


xox 


ami49 


1 


1.36 


0.90 


948 


6045 


3.0 




4 


1.00 


0.46 


958 


6083 


12.3 




16 


0.74 


0.29 


1040 


6509 


52.1 






U.DD 


fl 91 


i Z D 


( Z ( o 


91 1 
Zl 1 .0 




fi/l _LT3 OT TISTT^ 
X3^-\-lX\J U IN U 


±.uu 


U.OO 


1 QnQ 


1 ( OX 




apte 


1 


1.08 


1.00 


328 


1668 


1.2 




4 


0.87 


0.57 


327 


1677 


5.0 




16 


0.53 


0.30 


336 


1725 


21.1 






44 


n 1 7 




loOD 


oD.o 




D4-|-±Vd U iM U 


l.UU 


1 nn 
l.UU 


Qcn 
oOU 


1 C /I 1 

lo4i 


nQ 

yo 


hc7 


1 


1.00 


1.19 


2203 


17670 


8.0 




4 


0.79 


0.61 


2206 


17738 


32.5 




16 


0.69 


0.31 


2301 


18481 


132.9 




64 


0.62 


0.23 


2498 


19660 


516.9 




64+ROUND 


0.89 


0.50 


2675 


20584 


562 


hp 


1 


0.92 


1.67 


334 


1952 


1.3 




4 


0.71 


0.85 


330 


1961 


5.2 




16 


0.46 


0.45 


334 


2003 


21.8 




64 


0.33 


0.29 


355 


2119 


89.8 




64+ROUND 


0.58 


1.00 


362 


2147 


101 


playout 


1 


0.64 


0.98 


2890 


23155 


14.9 




4 


0.52 


0.42 


2892 


23199 


60.6 




16 


0.40 


0.24 


2922 


23582 


257.5 




64 


0.33 


0.17 


3238 


25809 


1118.8 




64+ROUND 


0.36 


0.28 


3467 


27281 


1176 


xc5 


1 


1.14 


1.31 


3187 


22314 


17.6 




4 


0.98 


0.66 


3202 


22492 


70.9 




16 


0.74 


0.37 


3277 


23231 


288.2 




64 


0.66 


0.31 


3570 


24872 


1134.7 




64+ROUND 


0.88 


0.57 


3895 


26305 


1216 


xerox 


1 


0.93 


1.42 


659 


3662 


2.8 




4 


0.72 


0.77 


660 


3698 


11.9 




16 


0.45 


0.40 


684 


3858 


54.0 




64 


0.32 


0.21 


753 


4174 


237.6 




64+ROUND 


0.47 


0.67 


779 


4295 


257 
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Table 1.3: Wirelength minimization (a = and /3 = 1) subject to wire and buffer congestion constraints 
(/io = 1-0 and vq = 1-0). RABID is the algorithm of [4], MCF is the multicommodity flow algorithm without 
pin assignment capability, and MCF+PA is the multicommodity flow algorithm with pin assignment enabled. 
Both MCF and MCF+PA were run with s = 0.3. 



Circuit 


Algorithm 


vv laii 


/orjij 


^buffers 
















gap 




gap 


Congest 


Congest 




a9c3 


RABID 


30723 


5.64 


4225 


11.95 


0.60 


0.44 


502 




MCF+ROUND 


29082 


0.00 


3800 


0.69 


0.67 


0.31 


775 




A /T/^T7 1 T) A 1 TZ> /~lT T1\TT\ 

MOr +r^A+KO U IN D 


26057 


0.00 


3378 


0.81 


0.62 


0.30 


779 




RABID 


5954 


7.67 


1037 


15.74 


0.58 


0.33 


208 




MCF+ROUND 


5530 


0.00 


905 


1.00 


0.77 


0.67 


204 




AJOTT' 1 "DA 1 T>*^TT'\TT\ 

ML-r +FA+H(JUJND 


4993 


0.00 


803 


1.13 


0.69 


0.67 


204 


ami33 


RABID 


5232 


6.93 


1150 


14.20 


0.69 


0.44 


138 




MCF+ROUND 


4893 


0.00 


1014 


0.70 


0.75 


0.33 


177 






4404 


U.UU 






U.oo 


U.oU 


1 '7'7 
lit 


ami49 


RABID 


7592 


11.87 


1339 


21.51 


0.93 


0.36 


167 




MCF+ROUND 


6792 


0.07 


1133 


2.81 


1.00 


0.60 


227 




MCF+PA+ROUND 


6041 


0.01 


989 


4.66 


1.00 


0.44 


218 


apte 


RABID 


2010 


10.78 


417 


18.47 


1.00 


0.33 


95 




MCF+ROUND 


1833 


1.03 


377 


7.10 


1.00 


1.00 


88 




MCF+PA+ROUND 


1663 


0.15 


331 


4.75 


1.00 


1.00 


87 


hc7 


RABID 


21523 


7.54 


2983 


17.44 


0.82 


0.35 


386 




MCF+ROUND 


20024 


0.05 


2591 


2.01 


0.96 


0.47 


551 




MCF+PA+ROUND 


17660 


0.00 


2214 


0.68 


0.86 


0.47 


543 


hp 


RABID 


2403 


11.12 


450 


20.97 


0.83 


0.28 


67 




MCF+ROUND 


2165 


0.13 


404 


8.60 


1.00 


1.00 


95 




MCF+PA+ROUND 


1945 


0.00 


345 


6.81 


0.92 


1.00 


94 


playout 


RABID 


27601 


6.38 


3840 


15.04 


0.45 


0.64 


813 




MCF+ROUND 


25946 


0.00 


3429 


2.73 


0.53 


0.34 


1002 




MCF+PA+ROUND 


23138 


0.00 


3011 


4.37 


0.40 


0.32 


995 


xc5 


RABID 


27060 


8.35 


4410 


23.25 


0.84 


0.81 


694 




MCF+ROUND 


25151 


0.71 


3843 


7.41 


0.98 


0.62 


1162 




MCF+PA+ROUND 


22265 


0.05 


3341 


4.90 


1.00 


0.65 


1175 


xerox 


RABID 


4541 


11.48 


957 


30.56 


0.93 


0.57 


167 




MCF+ROUND 


4078 


0.12 


805 


9.82 


1.00 


1.00 


212 




MCF+PA+ROUND 


3658 


0.00 


692 


6.30 


0.88 


0.67 


208 



Table 1.4; Wirelength minimization results for non-inverting vs. inverting buffer insertion. The number of 
buffer sites was assumed to be the same in both experiments. 



Testcase 


Non-inverting buffers 


Inverting buffers 


Wlen 


#buffers 


W-Congest B-Congcst 


CPU 


Wlcn 


# buffers 


W-Congest 


B- Congest 


CPU 


a9c3 


29082 


3800 


0.67 


0.31 


775 


29082 


4540 


0.60 


0.41 


1470 


ac3 


5530 


905 


0.77 


0.67 


204 


5530 


1095 


0.69 


0.50 


417 


ami33 


4893 


1014 


0.75 


0.33 


177 


4893 


1186 


0.62 


0.29 


359 


ami49 


6792 


1133 


1.00 


0.60 


227 


6790 


1417 


1.00 


1.00 


449 


apte 


1833 


377 


1.00 


1.00 


88 


1833 


441 


1.00 


1.00 


185 


hc7 


20024 


2591 


0.96 


0.47 


551 


20024 


3358 


0.89 


0.50 


1030 


hp 


2165 


404 


1.00 


1.00 


95 


2164 


495 


1.00 


1.00 


201 


playout 


25946 


3429 


0.53 


0.34 


1002 


25946 


4235 


0.53 


0.32 


1982 


xc5 


25151 


3843 


0.98 


0.62 


1162 


25222 


4799 


0.96 


1.00 


2285 


xerox 


4078 


805 


1.00 


1.00 


212 


4155 


1050 


1.18 


1.00 


520 
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Table 1.5; Runtime scaling for the timing-driven version of the MCF algorithm (delay measured by number 
of inserted buffers). 



Testcase 


Dclay 


juuiici = 1 


Delay 


jouiici = 2 


Delay 


jouiici = 1 


Delay 


jouiicl = 8 


No delay 


bound 


#nets 


CPU 


#nets 


CPU 


#nets 


CPU 


#nets 


CPU 


#nets 


CPU 


a9c3 


455 


77 


820 


361 


1361 


2178 


1526 


7667 


1526 


775 


acS 


152 


29 


249 


122 


374 


666 


409 


2270 


409 


204 


ami33 


63 


13 


125 


50 


260 


413 


323 


1761 


324 


177 


ami49 


177 


25 


298 


113 


442 


598 


493 


2161 


493 


227 


aptc 


49 


12 


67 


33 


126 


255 


141 


968 


141 


88 


hc7 


569 


70 


873 


305 


1231 


1584 


1318 


5217 


1318 


551 


hp 


76 


15 


124 


63 


174 


330 


187 


1083 


187 


95 


playout 


657 


124 


1095 


651 


1575 


3506 


1663 


10979 


1663 


1002 


xc5 


1072 


192 


1429 


748 


2100 


4158 


2149 


12351 


2149 


1162 


xerox 


163 


29 


282 


201 


360 


752 


390 


2308 


390 


212 
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Figure 1.1: Tile graph with two 2-pin nets. 




H» Cap=w(u,v) v" 



Figure 1.2: The basic gadget replacing edge {u,v) of the tile graph for buffer wireload upperbound U = 5. 



(1) Set :=^yve V{G), := ^ Ve 6 E{G), u := % 

(2) Set Xp:=0\/peV 

(3) Set r = and p^ := for i = 1, k. 

(4) While /io K'")yv + X] """("i + -Dw < 1 do: 

t>ev(G) (u,i;)eE(G) 

(5) begin 

(6) r := r+l. 

(7) For i := 1 to k, do 

(8) begin 

(9) lfp, = 0or \Vi^E^\{yv + Ciu)+ T, |pi H (2„,„ + /3ti) > (1+ 7c) then 

v€V{G) {u,v)€E(G) 

(10) begin 

(11) Find a path Pi e Pi minimizing /i := Yl \Pi {y^ + au) + Yl \Pi <^ Eu,v\ {zu,v + Pu) 

v€V(G) {u,v)€E(G) 

(12) end 

(13) Set Xp^ := Xp^ + 1 

(14) Sety^:=y^ ll + eil^)yveViG), := (^1 + e^^^'^ \/{u,v) e E{G) 

a ^ lpinE„| + /3 ^ |PinEu_.„|\ 
«€V(G) (u,t.)€E(G) I 

(15) end 

(16) end 

(17) Output {xp/r)^^j, 




Figure 1.3: Algorithm for approximately solving LP p.2|) . 
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Pi 
(a) 
• ^ 



S3 62 



Pi 
(b) 



-o 



S3 P2 



Pi 
(c) 



-O 



S3 P2 62 

o — I — < 



6 s. 



(d) 



-O 



Figure 1.4: (a) Minimum buffered routing T of a 4-pin net with source s and sinks si, 82,83 pro- 
duced by the algorithm in [6]. T has 2 buffers 61 and 62 and two Steiner points pi and P2- (b) 

Decomposition of T using fixed branching and fixed buffering. There are 7 resulting 2-pin nets: 
(s, 61), (pi, si), (pi, 62), (fo2,P2), (P2, S2), and (^3,33). (c) Decomposition of T using fixed branch- 

ing and flexible buffering. There are 5 resulting 2-pin nets: (s,pi), (pi, si), (pi,p2), (P2, S2), and (^3,53). 
(d) Decomposition of T using half-flexible branching and flexible buffering. The Steiner point pi is fixed 
and, therefore, split between 3 nets, while the Steiner point p2 is flexible. There are 2 resulting 2-pin nets: 
{s,Pi), (pi, si) and one 3-pin net (pi, S2, S3). 



(1) Set w* := 00 

(2) For allveV do 

(3) begin 

(4) For j := to {/ 

(5) begin 

(6) Find a shortest v'^~ 

(7) For k--OtoU-j 

(8) begin 

(9) Find a siiortest 1;^"* - tf-patli P2 in H 

(10) Find a siiortest s° - ■y'^"-'"'"-patli Pq in H 

(11) Ifii)(Po)-|-«;(Pi)-|-«;(P2) < w* tlien 

(12) Set w* ■- w{Po) -h w{Pi) -h w{P2); T* := Po U Pi U P2 

(13) end 

(14) end 

(15) end 

(16) return T* 



// try all possible Steiner points 



tj-path Pi in H 



Figure 1.5: Algorithm for finding minimum weight buffered routings for 3-pin nets. 





Figure 1.6: Gadget for polarity constraints with buffer load upperbound U = 2. 
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Figure 1.7: (a) Gadget for buffer sizing with two available buffer sizes, one with wireload upperbound 

U = 4 and one with wireload upperbound U = 2. Solid arcs (m*, respectively correspond to 

the insertion of a buffer capable of driving 4 units of wire, while dashed arcs (m% u^) and (t)', v"^) correspond 
to the insertion of a smaller buffer capable of driving 2 units of wire, (b) Gadget for wire sizing with two 
available wire widths, standard width and "half" width (i.e., wire with double per unit capacitive load). The 
solid arcs {u^,v^~^) and {v^,u^~^) correspond to standard width connections between tiles u and v, while 
dashed arcs and correspond to half-width connections. 




Figure 1.8: Gadget for enforcing delay constraints when the delay is measured by the number of buffers 
inserted between source and sink. The basic gadget is replicated a number of times equal to the maximum 
allowed net delay (3 in this example). Tile- to- tile arcs decrease remaining wireload budget within a gadget 
replica, while buffer arcs advance from one replica to the next. 



